knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
# Default packages
library(tidyverse)
library(here)
library(broom)
# Time series packages
library(tsibble)
library(feasts)
library(fable)
# Spatial data packages
library(sf)
library(tmap)
For Lab 5, you will work through the following tutorial on your own (i.e., there is not an additional recording). This is because once you leave Bren, you will learn most new skills through written tutorials, books, and blog posts - so we want you to feel familiar and confident with the expected post-school format of continued learning.
Remember to reach out on our course Slack channel if you get stuck, we’ll be ready to help! Have fun forecasting with time series data, and exploring & visualizing some spatial data.
Fork the lab 5 repo from GitHub, then clone to create a local version-controlled R Project. The project contains the required data in a data subfolder, and the keys in the keys subfolder. The keys should be for reference if you get stuck - but it is very important for learning and retention that you try following along on your own first, troubleshooting as needed, before you use the key for help.
Add a new subfolder (called my_code or something) where you’ll save your R Markdown documents following along with the instructions below.
To reinforce skills for wrangling, visualizing, and forecasting with time series data, we will use data on US residential energy consumption from January 1973 - October 2017 (from the US Energy Information Administration).
tidyverse, tsibble, feasts, fableRead in the energy.csv data (use here(), since it’s in the data subfolder).
energy <- read_csv(here("data", "energy.csv"))
Explore the energy object as it currently exists. Notice that there is a column month that contains the month name, and 4-digit year. Currently, however, R understands that as a character (instead of as a date). Our next step is to convert it into a time series data frame (a tsibble), in two steps:
Here’s what that looks like in a piped sequence:
energy_ts <- energy %>%
mutate(date = tsibble::yearmonth(month)) %>%
as_tsibble(key = NULL, index = date)
Now that it’s stored as a tsibble, we can start visualizing, exploring and working with it a bit easier.
Exploratory data visualization is critical no matter what type of data we’re working with, including time series data.
Let’s take a quick look at our tsibble (for residential energy use, in trillion BTU):
ggplot(data = energy_ts, aes(x = date, y = res_total)) +
geom_line() +
labs(y = "Residential energy consumption \n (Trillion BTU)")
Looks like there are some interesting things happening. We should ask:
The big ones to notice quickly here are:
A seasonplot can help point out seasonal patterns, and help to glean insights over the years. We’ll use feasts::gg_season() to create an exploratory seasonplot, which has month on the x-axis, energy consumption on the y-axis, and each year is its own series (mapped by line color).
energy_ts %>%
gg_season(y = res_total) +
theme_minimal() +
labs(x = "month",
y = "residential energy consumption (trillion BTU)")
This is really useful for us to explore both seasonal patterns, and how those seasonal patterns have changed over the years of this data (1973 - 2017). What are the major takeaways from this seasonplot?
Let’s explore the data a couple more ways:
energy_ts %>% gg_subseries(res_total)
Our takeaway here is similar: there is clear seasonality (higher values in winter months), with an increasingly evident second peak in June/July/August. This reinforces our takeaways from the raw data and seasonplots.
See Rob Hyndman’s section on STL decomposition to learn how it compares to classical decomposition we did last week: “STL is a versatile and robust method for decomposing time series. STL is an acronym for “Seasonal and Trend decomposition using Loess”, while Loess is a method for estimating nonlinear relationships."
Notice that it allows seasonality to vary over time (a major difference from classical decomposition, and important here since we do see changes in seasonality).
# Find STL decomposition
dcmp <- energy_ts %>%
model(STL(res_total ~ season()))
# View the components
# components(dcmp)
# Visualize the decomposed components
components(dcmp) %>% autoplot() +
theme_minimal()
We use the ACF to explore autocorrelation (here, we would expect seasonality to be clear from the ACF):
energy_ts %>%
ACF(res_total) %>%
autoplot()
And yep, we see that observations separated by 12 months are the most highly correlated, reflecting strong seasonality we see in all of our other exploratory visualizations.
Note: here we use ETS, which technically uses different optimization than Holt-Winters exponential smoothing, but is otherwise the same (From Rob Hyndman: “The model is equivalent to the one you are fitting with HoltWinters(), although the parameter estimation in ETS() uses MLE.”)
To create the model below, we specify the model type (exponential smoothing, ETS), then tell it what type of seasonality it should assume using the season("") expression, where “N” = non-seasonal (try changing it to this to see how unimpressive the forecast becomes!), “A” = additive, “M” = multiplicative. Here, we’ll say seasonality is multiplicative due to the change in variance over time and also within the secondary summer peak:
# Create the model:
energy_fit <- energy_ts %>%
model(
ets = ETS(res_total ~ season("M"))
)
# Forecast using the model 10 years into the future:
energy_forecast <- energy_fit %>%
forecast(h = "10 years")
# Plot just the forecasted values (with 80 & 95% CIs):
energy_forecast %>%
autoplot()
# Or plot it added to the original data:
energy_forecast %>%
autoplot(energy_ts)
We can use broom::augment() to append our original tsibble with what the model predicts the energy usage would be based on the model. Let’s do a little exploring through visualization.
First, use broom::augment() to get the predicted values & residuals:
# Append the predicted values (and residuals) to original energy data
energy_predicted <- broom::augment(energy_fit)
# Use View(energy_predicted) to see the resulting data frame
Now, plot the actual energy values (res_total), and the predicted values (stored as .fitted) atop them:
ggplot(data = energy_predicted) +
geom_line(aes(x = date, y = res_total)) +
geom_line(aes(x = date, y = .fitted), color = "red")
Cool, those look like pretty good predictions!
Now let’s explore the residuals. Remember, some important considerations: Residuals should be uncorrelated, centered at 0, and ideally normally distributed. One way we can check the distribution is with a histogram:
ggplot(data = energy_predicted, aes(x = .resid)) +
geom_histogram()
We see that this looks relatively normally distributed, and centered at 0 (we could find summary statistics beyond this to further explore).
This is the END of what you are expected to complete for Part 1 on time series exploration and forecasting. Section E, below, shows how to use other forecasting models (seasonal naive and autoregressive integrated moving average, the latter which was not covered in ESM 244 this year).
There are a number of other forecasting methods and models! You can learn more about ETS forecasting, seasonal naive (SNAIVE) and autoregressive integrated moving average (ARIMA) from Hyndman’s book - those are the models that I show below.
# Fit 3 different forecasting models (ETS, ARIMA, SNAIVE):
energy_fit_multi <- energy_ts %>%
model(
ets = ETS(res_total ~ season("M")),
arima = ARIMA(res_total),
snaive = SNAIVE(res_total)
)
# Forecast 3 years into the future (from data end date)
multi_forecast <- energy_fit_multi %>%
forecast(h = "3 years")
# Plot the 3 forecasts
multi_forecast %>%
autoplot(energy_ts)
# Or just view the forecasts (note the similarity across models):
multi_forecast %>%
autoplot()
We can see that all three of these models (exponential smoothing, seasonal naive, and ARIMA) yield similar forecasting results.
In the Week 5 lecture, you learned a bit more about projection and coordinate reference systems, types of spatial data, and investigating spatial autocorrelation using variograms. We’ll practice working with spatial data this week, then move on to variograms, spatial interpolation and point pattern analysis (exploring spatial clustering) next week.
Today, we’ll use vector data (polygons, points) to practice reading in spatial data, checking & updating the CRS, and doing some wrangling and visualization.
We’ll use several datasets:
read_sfFirst, let’s read in the California county shapefile:
ca_counties <- read_sf(here("data","ca_counties","CA_Counties_TIGER2016.shp"))
Use View(ca_counties) to check out what it contains. Let’s simplify it by only keeping two attributes: NAME (county name) and ALAND (land area), then renaming those to county_name and land_area.
ca_subset <- ca_counties %>%
select(NAME, ALAND) %>%
rename(county_name = NAME, land_area = ALAND)
Take a look at ca_subset. We should notice something very important about a simple features (sf) object: it just assumes you want to keep the spatial information, and you can work with the rest of the data as if it’s a non-spatial data frame (and the spatial information just “sticks” - hence the term “sticky geometry”). So even though we only called NAME and ALAND in the select() function, we see that the geometry column still exists!
Use st_crs() to check the existing CRS for spatial data. We see that this CRS is WGS84 (epsg: 3857).
ca_subset %>% st_crs()
## Coordinate Reference System:
## User input: 3857
## wkt:
## PROJCS["WGS 84 / Pseudo-Mercator",
## GEOGCS["WGS 84",
## DATUM["WGS_1984",
## SPHEROID["WGS 84",6378137,298.257223563,
## AUTHORITY["EPSG","7030"]],
## AUTHORITY["EPSG","6326"]],
## PRIMEM["Greenwich",0,
## AUTHORITY["EPSG","8901"]],
## UNIT["degree",0.0174532925199433,
## AUTHORITY["EPSG","9122"]],
## AUTHORITY["EPSG","4326"]],
## PROJECTION["Mercator_1SP"],
## PARAMETER["central_meridian",0],
## PARAMETER["scale_factor",1],
## PARAMETER["false_easting",0],
## PARAMETER["false_northing",0],
## UNIT["metre",1,
## AUTHORITY["EPSG","9001"]],
## AXIS["X",EAST],
## AXIS["Y",NORTH],
## EXTENSION["PROJ4","+proj=merc +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 +x_0=0.0 +y_0=0 +k=1.0 +units=m +nadgrids=@null +wktext +no_defs"],
## AUTHORITY["EPSG","3857"]]
Plot the California counties using geom_sf(). Notice that we can update aesthetics just like we would for a regular ggplot object. Here, we update the color based on land area (and change the color gradient).
ggplot(data = ca_subset) +
geom_sf(aes(fill = land_area), color = "white", size = 0.1) +
theme_void() +
scale_fill_gradientn(colors = c("cyan","blue","purple"))
Red sesbania (Sesbania punicea) is an invasive plant (see more information from the California Invasive Plants Council). Observations for locations of invasive red sesbania are from CA DFW. See metadata and information here: https://map.dfg.ca.gov/metadata/ds0080.html
The data exist data/red_sesbania, and the shapefile is stored as ds80.shp. Let’s read in the data:
sesbania <- read_sf(here("data","red_sesbania","ds80.shp"))
# Check the CRS:
sesbania %>% st_crs()
## Coordinate Reference System:
## No user input
## wkt:
## PROJCS["Custom",
## GEOGCS["GCS_North_American_1983",
## DATUM["North_American_Datum_1983",
## SPHEROID["GRS_1980",6378137.0,298.257222101]],
## PRIMEM["Greenwich",0.0],
## UNIT["Degree",0.0174532925199433],
## AUTHORITY["EPSG","4269"]],
## PROJECTION["Albers_Conic_Equal_Area"],
## PARAMETER["False_Easting",0.0],
## PARAMETER["False_Northing",-4000000.0],
## PARAMETER["longitude_of_center",-120.0],
## PARAMETER["Standard_Parallel_1",34.0],
## PARAMETER["Standard_Parallel_2",40.5],
## PARAMETER["latitude_of_center",0.0],
## UNIT["Meter",1.0]]
Notice that this CRS is different from the California counties CRS, so we’ll want to update it to match. Use st_transform() to update the CRS:
sesbania <- st_transform(sesbania, 3857)
# Then check it:
sesbania %>% st_crs()
## Coordinate Reference System:
## User input: EPSG:3857
## wkt:
## PROJCS["WGS 84 / Pseudo-Mercator",
## GEOGCS["WGS 84",
## DATUM["WGS_1984",
## SPHEROID["WGS 84",6378137,298.257223563,
## AUTHORITY["EPSG","7030"]],
## AUTHORITY["EPSG","6326"]],
## PRIMEM["Greenwich",0,
## AUTHORITY["EPSG","8901"]],
## UNIT["degree",0.0174532925199433,
## AUTHORITY["EPSG","9122"]],
## AUTHORITY["EPSG","4326"]],
## PROJECTION["Mercator_1SP"],
## PARAMETER["central_meridian",0],
## PARAMETER["scale_factor",1],
## PARAMETER["false_easting",0],
## PARAMETER["false_northing",0],
## UNIT["metre",1,
## AUTHORITY["EPSG","9001"]],
## AXIS["X",EAST],
## AXIS["Y",NORTH],
## EXTENSION["PROJ4","+proj=merc +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 +x_0=0.0 +y_0=0 +k=1.0 +units=m +nadgrids=@null +wktext +no_defs"],
## AUTHORITY["EPSG","3857"]]
Cool, now they have the same CRS.
Note: this may take a minute.
ggplot() +
geom_sf(data = ca_subset) +
geom_sf(data = sesbania, size = 1, color = "red")
Let’s say we want to find the count of red sesbania observed locations in this dataset by county. How can I go about joining these data so that I can find counts? Don’t worry…st_join() has you covered for spatial joins!
ca_sesbania <- ca_subset %>%
st_join(sesbania)
And then we can find counts (note: these are not counts for individual plants, but by record in the dataset) by county:
sesbania_counts <- ca_sesbania %>%
count(county_name)
Then we can plot a chloropleth using the number of records for red sesbania as the fill color (instead of what we used previously, land area):
ggplot(data = sesbania_counts) +
geom_sf(aes(fill = n), color = "white", size = 0.1) +
scale_fill_gradientn(colors = c("lightgray","orange","red")) +
theme_minimal() +
labs(fill = "Number of S. punicea records")
So we see that we can still use our usual wrangling skills! Let’s do a bit more for fun, just to prove that our existing wrangling skills still work with spatial data - the spatial information just sticks to it! Only plot the county with the greatest number of red sesbania records (Solano), and make a map of those locations (yeah there are many ways to do this):
# Subset of sesbania point locations only in Solano County
solano_sesbania <- sesbania %>%
filter(COUNTY == "Solano")
# Only keep Solano polygon from California County data
solano <- ca_subset %>%
filter(county_name == "Solano")
ggplot() +
geom_sf(data = solano) +
geom_sf(data = solano_sesbania)
Sometimes we’ll want to make a map interactive so that audience members can zoom in, explore different areas, etc. We can use the {tmap} package to create an interactive map. Let’s make one for our California counties (fill aesthetic by land area) with the red sesbania locations on top:
# Set the viewing mode to "interactive":
tmap_mode(mode = "view")
# Then make a map (with the polygon fill color updated by variable 'land_area', updating the color palette to "BuGn"), then add another shape layer for the sesbania records (added as dots):
tm_shape(ca_subset) +
tm_fill("land_area", palette = "BuGn") +
tm_shape(sesbania) +
tm_dots()
See all kinds of other cool ways you can update your interactive tmaps.
See: